Thom Su

May 4, 2021

Project Overview


This capstone project is part of the Google Data Analytics Professional Certificate on Coursera. The project assumes the following scenario. Being a junior data analyst on the marketing team at Cyclistic, a bike-sharing company in Chicago, I was assigned to analyze how Cyclistic riders without annual membership, which is the group of interest for this project, use Cyclistic bikes differently than Cyclistic member riders. Cyclistic riders without annual membership are referred to as casual riders. The financial analysts in the company had concluded that annual members are more profitable than casual riders. The director of marketing believes that maximizing the number of annual members by converting casual riders into members is the right business strategy to ensure future growth and company’s success. Rather than employing a broad-based marketing campaign, the director of marketing is considering a targeted marketing strategy using social media.

Project Goals


I’ll be searching for patterns and trends in a variety of metrics that would show how casual riders usage of Cyclistic bike differ from member usage of Cyclistic bike. Some of the metrics to consider are:

Data Preparation


In this fictitious scenario, I get all my data internally but I’m refrained from using data that has any personal identifiable information. In reality, I’m provided with a public dataset online licensed by Motivate International. For this project, I’m using data that contains information about each ride, like time and location of the start and end of a bike trip, from April 2020 to March 2021.

Please note that Covid 19 pandemic began in March 2020 in the US and Covid 19 lockdown and/or restrictive measures are in place for most of if not the entire period. This include shutting down of businesses that are not essential services, limited restaurant dining seating and travel restrictions.

There are some pros and cons about using this data. One of the advantages is that since travel is fairly restricted, the tourist usage of Cyclistic bike should be minimal. Tourists are not interested in becoming an annual Cyclistic member. As there is no way to filter out based on personal information, using this data for analysis is ideal. However, how riders utilize Cyclistic bike during the pandemic may be different from how riders bike in normal times. Since the pandemic began, many people stopped commuting to work because they either lost their job or started working remotely. The patterns and trends from this data may not be representive of patterns and trends post-pandemic. This is something we need to keep in mind when looking at the results.

Let’s take a look at the first few rows of the data that we are using for this project.

Let’s familiarize the data by checking out the metadata summary.

Data summary
Name bikeshare_df
Number of rows 3489748
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 3489539 0
rideable_type 0 1 11 13 0 3 0
started_at 0 1 19 19 0 3040228 0
ended_at 0 1 19 19 0 3027775 0
start_station_name 0 1 0 53 122175 709 0
end_station_name 0 1 0 53 143242 707 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 83583 0.98 925.20 3046.76 0.00 76.00 188.00 324.00 20258.00 ▇▁▁▁▁
end_station_id 98104 0.97 924.20 3042.47 0.00 76.00 188.00 324.00 20258.00 ▇▁▁▁▁
start_lat 0 1.00 41.90 0.04 41.64 41.88 41.90 41.93 42.08 ▁▁▇▇▁
start_lng 0 1.00 -87.64 0.03 -87.87 -87.66 -87.64 -87.63 -87.52 ▁▁▂▇▁
end_lat 4738 1.00 41.90 0.04 41.54 41.88 41.90 41.93 42.16 ▁▁▇▅▁
end_lng 4738 1.00 -87.65 0.03 -88.07 -87.66 -87.64 -87.63 -87.44 ▁▁▁▇▁

There is so much useful information here. The table consist of 3,489,748 rows and 13 columns. 7 columns are of character data type and 6 are of numeric data type. None of the character type columns are unique. Upon further investigation, none of the columns are unique and therefore could not be used as a primary key for the SQL table. Although there are no missing values for character type columns, there are 122,175 empty strings found in start_station_name and 143,242 empty strings in end_station_id. As for numeric columns, start_station_id, end_station_id, end_lat, end_lng contain missing values.

Next, let’s check to see how many rows contain at least a missing value.

[1] "There are 135234 rows with missing values, excluding empty fields."

We are going to first check on missing values starting with the start_id column.

It is curious that the started_at column date begins from July 2020. The start_station_name column are empty while end_station_name and end_station_id contain some information.

I want to check out the empty strings at the start_station_name column to look for any patterns.

The started_at column date also begins in July, 2020. Both start_station_id and start_station_name contains no information. I noticed that the rideble_type are all electic_bike just like the previous result.

I want to see the same results for dates closest to the present.

The dates are in March, 2021. The rideable_type entries are only electric_bike. However, both start_station_id and end_station_id are 0 while start_station_name and end_station_name are empty strings.

I’m beginning to think that there is a relationship between electric_bike and the missing data.

Next, we are going to examine observations with start_station_id but with no information on start_station_name.

The rideable_type are still electric_bike but the date starts from December 2021. While the start_station_id are 0, the values in the start_lat and the start_lng are not unique, indicating that start_station_id is not just at one location.

Including start_station_id and end_station_id that are 0 can still be useful for some analyses while it can be problematic for other types of analyses. This is something that we also need to keep in mind.

Next, we are going to check out the results from dates closest to the present.

We are seeing the same pattern again: electric_bike in rideable_type, 0 in start_station_id, and start_lat and start_lng do not contain unique values.

We are going to examine the count of start_station_name that are not empty with start_station_id equal to 0.

This is further confirmation that station_id 0 is not at some specific location. Using either start_station_name or end_station_name would be a better choice for analysis than using start_station_id or end_station_id.

Let’s check to see if there are any start_station_id or end_station_id that are not 0 or missing and have empty string for the corresponding start_station_name or end_station_name.

If the start_station_id or end_station_id is not 0 or missing, then the corresponding start_station_name or end_station_name is not an empty string.

Next, We’ll check if there are any rows where the rideable_type is not electric_bike and the start_station_name or end_station_name is an empty string.

The results are not what I expected. While there are no rows with classic_bike or docked_bike and empty start_station_name, there are rows with bikes that are not electric_bike containing empty end_station_name. Curiously enough, the end_lat and end_lng are also missing but not ended_at for rows with empty end_station_name. I don’t know what to make of this finding.

Next, let’s check which start_station_id have more than one start_station_name.

There are 23 start_station_id with 2 different names.

Let’s check our the start_station_name for these cases.

It appears that some of these stations were moved to a different location while others that were labeled (Temp) or (*) have those labels removed.

We are going to look at start_station_name with more than one corresponding start_station_id.

There are 279 start_station_name with two different start_station_id.

Let’s find out the start_station_id for these cases.

These stations have one thing in common. One of the station is 2-3 digits and the other is 5 digits.

Let’s check to see if there are any rows where start_station_name and end_station_name are the same.

It maybe worth examining these round trips in more detail later.

Engineering New Features


Let’s create some features based on started_at.

We now create two new features about the trip: trip_duraton and trip_distance.

Let’s now check out the histogram of the trip_duration that we just created.

The distribution is skewed to the right and it appears that there are some trips with negative trip_duration.

My first intuition is that the negative values in trip_duration is the result of time change that happens twice a year. The results above show many different dates and this is clearly bad data.

We are going to remove the rows of bad data.

In additional, we’ll remove trips that are less than five minutes where the start_station_name and the end_station_name are the same.

Data Visualization


The first graph we are going to build will show member and casual riders usage of the different bike type.

Overall, docked_bike has the highest number of trip of the three different bike types. It is true for both groups of riders. However, the initial exploration of data hints that electric_bike may have been a new addition to the bike fleet. The empty start_station_name, end_station_name and missing start_station_id, end_station_id may have been due to the addition of new bike stations to accommodate new bikes.

Next, let’s build a bar chart to show the average trip duration of member and casual riders for the three different bike types.

While the average trip duration is roughly the same for member riders at about 15 minutes, the average time casual riders spend on the three different types of bike varies much more. On average, casual riders spend the most time on docked_bike, 52 minutes, and least on electric_bike, 22 minutes.

Let’s now check out how the total number of trips vary by the day of week.

There are a few Saturdays during the twelve month period with daily trip totals higher than any other day of the week. Tuesday, Wednesday and Thursday have similar distribution of daily trip totals.

We are now constructing the same plot for just the member group.

Overall, the distribution of the daily trip totals for the different days of the week looks similar with just a slight variation.

Here is the plot for the casual riders.

With the exception on one Monday, there are a few Fridays, Saturdays and Sundays that is higher than the rest of the week. Casual riders are more likely to bike on weekends.

The following chart breaks down the daily count of trips by hour of day.

Most of the trips occur in the afternoon and early evening. The hour between 5pm - 6pm has the highest average count.

This is the hourly breakdown for member group.

Now, we look at the chart for casual riders and compare the two groups.

Although, the two charts look relatively similar, there are subtle differences. There are more casual riders than members who ride the bike late evenings and more member riders than casual riders who bike during the morning commute between 6am - 9am.

Here is a look at the histogram of the duration of trips by the two groups.

This plot confirms that casual riders are more likely to take longer trips than members.

We are now going to create a plot of the average distance travelled by the two groups for each month during the one year period. We’ll not be including any trips where the starting station is the same as the ending station.

For all trips where the starting station and ending station are different, casual riders on average travel farther than member riders. Both member and casual riders bike shorter distances between December to February.

Let’s create a map of all the locations of Cyclistic bike stations.

The following are two maps, side by side, of the 50 top bike stations with the most casual (left) or member (right) riders starting the trip.

The bike stations are generally in the same area, downtown Chicago and neighborhoods like Lincoln Park and Lakeview on the Northside that are next to Lake Michigan. Notice the size of the circles for casual riders are generally smaller than member for bike stations on the Northside. This indicates that more member riders start the trip at these bike stations than casual riders.

Here is a graph of the names of the top 50 bike stations with the most members starting the trip.

Now, this is a graph of the names of the top 50 bike stations with the most casual riders starting the trip.

The top member starting station, Clark St. & Elm St. is in the 8th place on the top casual starting station. The top casual starting station, Streeter Dr. & Grand Ave. is 50th on the top member starting station. The number of trips starting at the stations falls off quickly for casual riders than for members.

Now, let’s examine the maps of the 50 top stations that have the most casual (left) and member (right) riders ending the bike trip.

The map looks very similar to that of starting bike stations.

We are going to create a plot comparing the top 50 ending member bike stations to the top 50 ending casual bike stations.

Similar observations of the top 50 starting station names are seen here.

Let’s create a table of the ten most frequent routes by the hour for each group. We’ll be leaving out bike trips that start and end at the same station name.

It’s time to create something exciting, an animated plot of the ten most frequent bike routes for the different hours of the day for the two groups.

The top 10 most frequent routes for casual riders between 10am - 8pm are in downtown Chicago and along North Lake Shore Drive.

Let’s create a new table of the top 20 stations with the highest combined total count of trips starting and ending at that location.

This time, we are building an interactive chord diagram of the top 20 stations for the member group that defines the route between pairs of stations as a relationship or an edge of a graph. We are going to examine these connections to look for patterns.

Let me quickly walk you through a few concepts of a chord diagram. Each station is represented by different colors. Simply click or hover on the station you would like to examine and all the routes with the station will be highlighted. The thickness of the ribbon represents the combined bidirectional count of trips between two stations and the color of the ribbon represents the color of the station with the larger count of trips ending there. If you hover on a ribbon, an infobox will pop up showing the total trip counts for both directions.

Take a few minutes to study the chord diagram above.

Next, we are going to create a chord diagram of the top 20 stations for casual riders and compare the diagrams for insights.

By comparing the two chord diagrams, we notice that there are a few large ribbon ‘humps’ in the diagram representing casual riders. This indicates that there are a large number of riders who started and ended the trip at the same station. We also can clearly see that the thickest ribbon between two different stations can be found on the casual rider chord diagram between Streeter Dr. & Grand Ave. and Lake Shore Dr. & Monroe St..

Finally, let’s create a chart of the top bidirectional routes excluding trips starting and ending at the same station for each month of the one year period.

The most frequent monthly routes are highest in July and August 2020 for casual riders. Top casual bike stations, like Streeter Dr, & Grand Ave., Millennium Park, Theater on the Lake and Lake Shore Dr. & Monroe St. have huge numbers during the warmer months, but these stations are hardly on the list in the colder months.

Piecing Together the Big Picture


Casual riders do sometimes bike more on weekends than on weekdays. They spend more time with the bike on average and travel on average farther than member riders. However, there are indications that they are not primarily using the bike as means of transport to get from point A to point B. While afternoon is the time of day that most casual riders ride Cyclistic bikes, in general more casual riders bike at night than members. Casual riders are more likely to use the bike for leisure activities and will often return the bike to the same station where they started.

While there is a pattern that most casual riders bike around downtown Chicago and along Lake Shore Drive on the Northside, there is not much evidence that the casual riders live in the area. A more likely explanation is that they are from other neighborhoods in the city or the suburbs and they came for an outing by the lake to enjoy the outdoors. Given that there were not many leisure activity options available due to the pandemic, many people chose to enjoy a beautiful summer weekend biking outdoors instead of staying at home. Tourist bike usage should be minimal during this timeframe because travel was fairly restrictive due to the pandemic.

Actionable Recommendations


As casual and member bike usage are quite dissimilar, tweaking the benefits of becoming a member may induce more people wanting to join. Because casual riders bike longer on average and are more likely to bike on weekends, perhaps increasing the amount of time members are allow to have a bike on weekends would provide more incentive to join.

Our target audience for a social media campaign is anyone who spent some time on weekends in areas where the most frequent casual stations are located. They may have post messages or photos about the time spent in the area. They should live in Chicago and the suburbs. The message of the campaign should be about the health and leisure aspect of biking and the benefits of becoming a member.

Thanks for reading!!!


 

Created from the heart by Thom Su

thomsu.nyc@gmail.com